4/23/24
Commonly denoted as “BLR”, it is a conditional data analytical method for understanding the relationship between independent and dependent variables based on Bayes Theorem while handling uncertainties with accuracy and interpretability[@walker2007application].
Regression analysis is a statistical method that allows to examine the relationship between two or more variables of interest. Suppose the response variable \(Y\) is a function of several predictor variables, \(x_{i1}, x_{i2},..., x_{ik}\).
To fit a model:
\[ Y = X \beta + \epsilon \quad(1) \] where:
The likelihood function of \(y\) given \(\beta\) and \(\sigma^2\) is defined as follows
\[ f(Y_i = y_i|x, \beta, \sigma^2) = f_y(y|x,\beta,\sigma^2) = (2\pi\sigma^2)^{-\frac{n}{2}} exp^{-\frac{1}{2\sigma^2}(y-x\beta)^T(y-x\beta)} \quad(2) \] which describes the probability of observing the data \(y\) given the predictors \(x\) and the parameters \(\beta\) and \(\sigma^2\).
To make an inference, the \(\beta\) parameters are determined through maximizing the likelihood function.
BLR is approached through probability distributions. Instead of estimating the response variable, y, as a single value, it’s drawn from a probability distribution. When using BLR, the model assumes that the response follows a normal distribution, expressed as
\[ y \sim N(\beta^TX, \sigma^2I) \quad(3) \]
where, y is assumed as a normal distribution centered by its mean and variance.
Additionally, the parameters in the model are generated from a probability distribution. The posterior probability of these parameters is proportional to the prior and likelihood [@minka2000bayesian].
\[ P(\beta|y, X) = \frac{P(y|\beta, X) × P(\beta|X)}{P(y|X)} \quad(4) \]
In Bayesian statistics, when the prior distribution and the posterior distribution belong to the same family of probability distributions, they are called conjugate distributions. Using Bayes’ theorem, we have the joint posterior distribution as follows:
\[ f(\beta, \sigma^2|y,x) = \frac{f_y(y|x, \beta,\sigma^2)f(\beta|\sigma^2,m,V)f(\sigma^2|a,b)}{f_y(y)} \quad(5) \]
where:
\(f_y(y|x, \beta,\sigma^2)\) is the likelihood function,
\(f(\beta|\sigma^2,m,V)\) is the prior distribution for coefficients,
\(f(\sigma^2|a,b)\) is the prior distribution for the error variance, and
\(f_y(y)\) is the marginal likelihood.
A non-conjugate prior refers to a prior distribution that does not belong to the same family of distributions as the posterior distribution after observing data. In such cases, the posterior distribution can be expressed as
\[ f(\beta|y, X, \sigma^2) ∝ f_y(y|X, \beta)\prod_{i=1}^{n}f(\beta_i|m_i,v_i) \quad(6) \] where:
\(f(\beta|y, X, \sigma^2)\) is the posterior distribution of coefficients,
\(f_y(y|X, \beta)\) is the likelihood function, and
\(f(\beta_i|m_i,v_i)\) is the prior distribution of each coefficient.
The selected Life Expectancy dataset comprises of 21 health, economic, and social factors taken from 179 countries between years 2000 and 2015.
# Read CSV file directly from a website URL
url = "C:/Users/shiji/OneDrive/桌面/STA6257_capstone/Life-Expectancy-Data-Updated.csv"
ledata = read.csv(url, header = TRUE)
names(ledata) <- c("Country", 'Region',"Year", "InfDth","UnFiveDths","AdultMort", "Alcohol","HepB","Measles","BMI","Polio", "Diphtheria", "HIV", "GDP", "population","ThinAd","ThinChild","Schooling",'Developing','Developed', "LifeExp")
str(ledata)'data.frame': 2864 obs. of 21 variables:
$ Country : chr "Turkiye" "Spain" "India" "Guyana" ...
$ Region : chr "Middle East" "European Union" "Asia" "South America" ...
$ Year : int 2015 2015 2007 2006 2012 2006 2015 2000 2001 2008 ...
$ InfDth : num 11.1 2.7 51.5 32.8 3.4 9.8 6.6 8.7 22 15.3 ...
$ UnFiveDths: num 13 3.3 67.9 40.5 4.3 11.2 8.2 10.1 26.1 17.8 ...
$ AdultMort : num 105.8 57.9 201.1 222.2 58 ...
$ Alcohol : num 1.32 10.35 1.57 5.68 2.89 ...
$ HepB : int 97 97 60 93 97 88 97 88 97 97 ...
$ Measles : int 65 94 35 74 89 86 97 99 87 92 ...
$ BMI : num 27.8 26 21.2 25.3 27 26.4 26.2 25.9 27.9 26.5 ...
$ Polio : int 97 97 67 92 94 89 97 99 97 96 ...
$ Diphtheria: int 97 97 64 93 94 89 97 99 99 90 ...
$ HIV : num 0.08 0.09 0.13 0.79 0.08 0.16 0.08 0.08 0.13 0.43 ...
$ GDP : int 11006 25742 1076 4146 33995 9110 9313 8971 3708 2235 ...
$ population: num 78.53 46.44 1183.21 0.75 7.91 ...
$ ThinAd : num 4.9 0.6 27.1 5.7 1.2 2 2.3 2.3 4 2.9 ...
$ ThinChild : num 4.8 0.5 28 5.5 1.1 1.9 2.3 2.3 3.9 3.1 ...
$ Schooling : num 7.8 9.7 5 7.9 12.8 7.9 12 10.2 9.6 10.9 ...
$ Developing: int 0 1 0 0 1 0 0 1 0 0 ...
$ Developed : int 1 0 1 1 0 1 1 0 1 1 ...
$ LifeExp : num 76.5 82.8 65.4 67 81.7 78.2 71.2 71.2 71.9 68.7 ...
Let’s look at the distribution of life expectancy and the discrepancies of life expectancy in all regions over the years.
# Distribution of Life Expectancy (years)
hist(ledata$LifeExp,
main = "Distribution of Life Expectancy (years)",
xlab = "Life Expectancy",
border = "white",
col = "lightblue",
ylim = c(0, 800))
# Create a line plot for max, mean, and min life expectancy in all regions over the years
life_expectancy_summary <- ledata %>%
group_by(Year, Region) %>%
summarise(
mean_life_expectancy = mean(LifeExp) )
ggplot(life_expectancy_summary, aes(x = Year)) +
geom_line(aes(y = mean_life_expectancy, color = Region), linetype = "solid") +
labs(title = "Regional Life Expectancy by Year (years)", x = "Year", y = "Life Expectancy", color = "Region") +
theme_minimal() +
scale_color_brewer(palette = "Set1")
# Life Expectancy by Economy Status and Region
ggplot(ledata, aes(x = factor(Developing), y = LifeExp, fill = factor(Developed))) +
geom_bar(stat = "summary", fun = "mean", position = "dodge", color = "black", width = 0.7) +
labs(title = "Life Expectancy by Economy Status and Region", x = "Economy Status", y = "Mean Life Expectancy") +
scale_fill_manual(values = c("skyblue", "salmon"), labels = c("Developing", "Developed")) +
theme_minimal() +
theme(legend.title = element_blank()) +
facet_wrap(~ Region)What is the impact of schooling on the lifespan of humans?
How does Infant and Adult mortality rates affect life expectancy?
Let’s continue visualizing data. :)
Does Life Expectancy has positive or negative correlation with Diphtheria Immunization coverage, GDP, Body Mass Index etc.?
Finally, let’s see how much these predictors are related to lifespan and the intercorrelations among two or more predictors.
InfDth UnFiveDths AdultMort Alcohol HepB Measles BMI
44.329144 44.469991 7.618888 1.939630 2.560952 1.587422 2.627136
Polio Diphtheria HIV GDP population ThinAd ThinChild
11.990081 12.940367 2.890280 2.016715 1.150097 8.949993 8.936053
Schooling
4.273524
Given the results of correlation and multicollinearity checking, we select the independent variables of GDP, BMI, alcohol, schooling and HIV and dependent variable of life expectancy to contribute to our final data set used for BLR analysis.
Create a BLR model
Evaluate the performance of BLR model
Compare with Frenquentist method
Our dataset comprises 2,864 observations, consisting of five predictors and one response variable.
In this analysis, we employ the stan_glm() function from the rstan package to establish a Bayesian Linear Regression model.
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.002192 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 21.92 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.332 seconds (Warm-up)
Chain 1: 0.74 seconds (Sampling)
Chain 1: 1.072 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 3.7e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.37 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.267 seconds (Warm-up)
Chain 2: 0.699 seconds (Sampling)
Chain 2: 0.966 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 3.7e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.37 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 0.425 seconds (Warm-up)
Chain 3: 0.702 seconds (Sampling)
Chain 3: 1.127 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 4.9e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.49 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 0.443 seconds (Warm-up)
Chain 4: 0.68 seconds (Sampling)
Chain 4: 1.123 seconds (Total)
Chain 4:
stan_glm
family: gaussian [identity]
formula: LifeExp ~ .
observations: 2864
predictors: 6
------
Median MAD_SD
(Intercept) 37.5 1.1
GDP 0.0 0.0
BMI 0.9 0.1
Alcohol 0.0 0.0
Schooling 1.2 0.0
HIV -1.6 0.0
Auxiliary parameter(s):
Median MAD_SD
sigma 4.6 0.1
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
Upon reviewing the summary, we discern parameter estimates for the model. Notably, the model’s equation can be represented as follows:
\[LifeExp =\]
\[37.5 + 0.9×BMI + 1.2×Schooling - 1.6×HIV \]
The stan_trace() was used to visualize the traces of parameters estimated by Markov chain Monte Carlo (MCMC) sampling.
Each trace plot to each variable appears centered around one value and all chains are well overlapping around one value.
Model Info:
function: stan_glm
family: gaussian [identity]
formula: LifeExp ~ .
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 2864
predictors: 6
Estimates:
mean sd 10% 50% 90%
(Intercept) 37.5 1.1 36.1 37.5 39.0
GDP 0.0 0.0 0.0 0.0 0.0
BMI 0.9 0.1 0.8 0.9 1.0
Alcohol 0.0 0.0 -0.1 0.0 0.0
Schooling 1.2 0.0 1.1 1.2 1.3
HIV -1.6 0.0 -1.6 -1.6 -1.5
sigma 4.6 0.1 4.5 4.6 4.7
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 68.9 0.1 68.7 68.9 69.0
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 3758
GDP 0.0 1.0 3832
BMI 0.0 1.0 3561
Alcohol 0.0 1.0 3902
Schooling 0.0 1.0 3069
HIV 0.0 1.0 4809
sigma 0.0 1.0 3563
mean_PPD 0.0 1.0 3787
log-posterior 0.0 1.0 1849
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
\(\hat{R}\) compares the variability in sampled parameter estimates across all chains combined with the variability within each individual change.
\(R = 1\) shows stability across chains.
The dark blue line shows the observed data while the light blue lines are simulations from the posterior predictive distribution.
Although the peaks of both the observed data distribution and the posterior distribution align along the x-axis, there’s a notable discrepancy in the density values along the y-axis.
The histograms exhibit the distribution of posterior samples obtained from the model.
In order to calculate the mean and median values of each parameter, we extracted posterior samples from the model and converted to a data frame .
Mean calculations
(Intercept) Alcohol GDP BMI Schooling HIV
37.54522 -0.04726 0.00012 0.89498 1.19895 -1.58586
sigma
4.58714
Median calculations
(Intercept) Alcohol GDP BMI Schooling HIV
37.52769 -0.04706 0.00012 0.89646 1.19960 -1.58576
sigma
4.58648
We analyzed how the observed data has changed the parameter estimates.
Priors for model 'glm_post1'
------
Intercept (after predictors centered)
Specified prior:
~ normal(location = 69, scale = 2.5)
Adjusted prior:
~ normal(location = 69, scale = 24)
Coefficients
Specified prior:
~ normal(location = [0,0,0,...], scale = [2.5,2.5,2.5,...])
Adjusted prior:
~ normal(location = [0,0,0,...], scale = [ 0.0014,10.7179, 5.9052,...])
Auxiliary (sigma)
Specified prior:
~ exponential(rate = 1)
Adjusted prior:
~ exponential(rate = 0.11)
------
See help('prior_summary.stanreg') for more details
Resulting Model \(LifeExp = 37.5 + 0.9×BMI + 1.2×Schooling + -1.6×HIV\).
Call:
glm(formula = LifeExp ~ ., family = gaussian, data = BLR_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-17.9434 -3.2096 0.1765 3.5416 16.8603
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.752e+01 1.137e+00 33.006 <2e-16 ***
GDP 1.216e-04 6.297e-06 19.319 <2e-16 ***
BMI 8.964e-01 5.143e-02 17.430 <2e-16 ***
Alcohol -4.680e-02 2.818e-02 -1.661 0.0968 .
Schooling 1.198e+00 4.699e-02 25.488 <2e-16 ***
HIV -1.586e+00 3.713e-02 -42.725 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 21.01963)
Null deviance: 253277 on 2863 degrees of freedom
Residual deviance: 60074 on 2858 degrees of freedom
AIC: 16858
Number of Fisher Scoring iterations: 2
The maximum likelihood estimates of the intercept and slopes are nearly identical to the mean values of posterior samples obtained from Bayesian Linear Regression.
Bayesian Linear Regression
(Intercept) Alcohol GDP BMI Schooling HIV
37.54522 -0.04726 0.00012 0.89498 1.19895 -1.58586
sigma
4.58714
Frenquentist method
(Intercept) GDP BMI Alcohol Schooling HIV
37.51720 0.00012 0.89637 -0.04680 1.19761 -1.58617
In this project, we successfully explored the effects of the predictors on Life expectancy and relationships among them. There were some notable observations while conducting our analysis.
There was a significant negative relationship between adult mortality, infants deaths, deaths under five years old, and life expectancy. As the rates of death of adults, infants, and children under 5 years old go up, life expectancy tends to shorten.
The rates of thinness of adolescents and adults are also critical factors affecting life expectancy, and there is a negative relationship existing between them.
Life expectancy varies depending on the disparity in years of schooling across regions. Generally, life expectancy tends to rise with increase of schooling and higher level of economy development since 2000.
Due to the presence of correlation and multicollinearity, we selected only five variables as predictors for conducting Bayesian Linear Regression Analysis.
Our model was successful in predicting life expectancy when considering the variables of GDP, BMI, Alcohol, schooling and HIV.
The convergence of four chains is strong, and it’s centered around a single value, indicating the excellent performance of the model.
Through observing the BLR model, the variables of BMI, Schooling, and HIV have more weight on predicting life expectancy.